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Abstract 

The paper revives an older approach to acoustic modeling that borrows from n-gram language 
modeling in an attempt to scale up both the amount of training data and model size (as measured 
by the number of parameters in the model), to approximately 100 times larger than current sizes used in 
automatic speech recognition. In such a data-rich setting, we can expand the phonetic context significantly 
beyond triphones, as well as increase the number of Gaussian mixture components for the context- 
dependent states that allow it. We have experimented with contexts that span seven or more context- 
independent phones, and up to 620 mixture components per state. Dealing with unseen phonetic contexts 
is accomplished using the familiar back-off technique used in language modeling due to implementation 
simplicity. The back-off acoustic model is estimated, stored and served using MapReduce distributed 
computing infrastructure. 

Speech recognition experiments are carried out in an N-best list rescoring framework for Google 
Voice Search. Training big models on large amounts of data proves to be an effective way to increase 
the accuracy of a state-of-the-art automatic speech recognition system. We use 87 000 hours of training 
data (speech along with transcription) obtained by filtering utterances in Voice Search logs on automatic 
speech recognition confidence. Models ranging in size between 20^0 million Gaussians are estimated 
using maximum likelihood training. They achieve relative reductions in word-error-rate of 11% and 6% 
when combined with first-pass models trained using maximum likelihood, and boosted maximum mutual 
information, respectively. Increasing the context size beyond five phones (quinphones) does not help. 
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I. Introduction 

As web-centric computing has grown over the last decade, there has been an explosion in the amount of 
data available for training acoustic and language models for speech recognition. Machine translation |[T1 
and language modeling for Google Voice Search ||2l have shown that using more training data is quite 
beneficial for improving the performance of statistical language models. The same holds true in many other 
applications as highlighted in Q. Of equal importance is the observation that the increase in training data 
amount should be paired with an increase in the model size. This is the situation in language modeling, 
where word n-grams are the core features of the model and more training data leads to more parameters 
in the model. We propose a similar approach for automatic speech recognition (ASR) acoustic modeling 
that is conceptually simpler than established techniques, but more aggressive in this respect. 

As a first step, it is worth asking how many training samples are needed to estimate a Gaussian well? 
Appendix |A] provides an answer for unidimensional data under the assumption that the n i.i.d. samples 
are drawn from a normal distribution of unknown mean and variance, can place an upper- 

bound on the probability that the sample mean X = ^ SILi -^i more than q ■ a away from the actual 
mean ^, for q small. For example, P{\X — /x| > 0.06 • a) < 0.06 when n = 983; similar values are 
obtained for the sample variance estimate. 

Typical amounts of training data used for the acoustic model (AM) in ASR vary from 100 to 1000 
hours. The frame rate in most systems is 100 Hz, (corresponding to advancing the analysis window in 
10-millisecond steps), which means that about 360 million samples are used to train the 0.5 million or-so 
Gaussians in a common state-of-the-art ASR system. Assuming that n = 1000 frames are sufficient for 
robustly estimating a single Gaussian, then 1000 hours of speech would allow for training about 0.36 



million Gaussians. This figure is quite close to values encountered in ASR practice, see Section IV-B 
or Table VI in |4|. We can thus say that current AMs achieve estimation efficiency: the training data is 
fully utilized for robust estimation of model parameters. 

Recent applications have led to availability of data far beyond that commonly used in ASR systems. 
Filtering utterances logged by the Google Voice Search service at an adequate ASR confidence threshold, 
(see ||5l for an overview on various confidence measures for ASR), guarantees transcriptions that are close 
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to human annotator performance, e.g. we can obtain 87 000 hours of automatically transcribed speech at 
a confidence level of 0.8 or higher in the accuracy of the transcription. If we are to strive for estimation 
efficiency, then this much speech data would allow training of AMs whose size is about 40 million 
Gaussians. From a modeling point of view the question becomes: what is the best way to "invest" these 
parameters to meaningfully model the speech signal? 

The most common technique for dealing with data sparsity when estimating context-dependent output 
distributions for HMM states is the well-known decision-tree (DT) clustering approach [6]. To make 
sure the clustered states have enough data for reliable estimation, the algorithm guarantees a minimum 
number of frames at each context-dependent state (leaf of the DT). The data at each leaf is modeled 
by a Gaussian mixture model (GMM). At the other end of the spectrum, states for which there is a lot 
more training data should have more mixture components. There is a vast amount of literature on such 
model selection techniques, see 171 for a recent approach, as well as an overview. |f8l shows that an 
effective way of sizing the GMM output distribution in HMMs as a function of the amount of training 
data (number of frames n) is the log-linear rule: 

log(num. components) = log(/3) + a ■ log(n) (1) 

We take the view that we should estimate as many Gaussian components as the data allows for a given 
state, according to the robustness considerations in Appendix |A] In practice one enforces both lower and 



upper thresholds on the number of frames for a given GMM (see Section IV-C for actual values used in 
our experiments), and thus the parameters a and /3 in ([T]) are set such that the output distributions for 
states are estimated reliably across the full range of the data availability spectrum. 

As a first direction towards increasing the model size when using larger amounts of training data, we 
choose to use longer phonetic context than the traditional triphones or quinphones: the phonetic context 
for an HMM state is determined by M context-independent (CI) phones to the left and right of the 
current phone and state. We experiment with values for M = 1, . . . , 3, thus reaching the equivalent of 
7-phones. For such large values of M not all M-phones (context dependent HMM states in our model), 
are encountered in the training data. At test time we deal with such unseen M-phones by backing-off, 
similar to what is done in n-gram language modeling: the context for an unseen M-phone encountered 
on test data is decreased gradually until we reach an M-phone that we have already observed in training. 

The next section describes our approach to increasing the state space using back-off acoustic modeling, 
and contrasts it with prior work. Section |lll] describes the back-off acoustic model (BAM) implementation 
using Google's distributed infrastructure, primarily MapReduce |i9J and SSTable (immutable persistent 
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B-tree, see llTOl ). along similar lines to their use in large scale language modeling for statistical machine 



translation Section IIV] presents our experiments in an N-best list rescoring framework, followed by 



conclusions. The current paper is a more comprehensive description of the experiments reported in ifTTll . 

II. Back-off N-grams for Acoustic Modeling 

Consider a short utterance whose transcription is: W = <S> action </S>, and assume the 
pronunciation lexicon provides the following mapping to CI phones sil ae k sh ih n sil. We 
use <S>, </S>to denote sentence boundaries, both pronounced as long silence sil. 

A triphone approach would model the three states of ih as sh-ih+n_{ 1 , 2 , 3 } using the DT 
clustering algorithm for tying parameters across various instances *-ih+*_{ 1,2,3}, respectively. This 
yields the so-called context-dependent states in the HMM. 

In contrast, BAM with M = 3 extracts the following training data instances (including back-off) for 
the first HMM state of the ih instance in the example utterance above: 

ih_l / ae k sh n sil frames 

ih_l / k sh n sil frames 

ih_l / sh n frames 

There are other possible back-off strategies, but we currently implement only the one above: 

• if the M-phone is symmetric (same left and right context length), then back-off at both ends 

• if not, then back-off from the longer end until the M-phone becomes symmetric, and proceed with 
symmetric back-offs from there on. 

To achieve this we first compute the context-dependent state-level Viterbi alignment between transcription 
W and speech feature frames using the transducer composition H oC o Lo W, where L, C, H denote 
respectively the pronunciation lexicon, context dependency tree, and HMM-to-state FST transducers L12J. 
From the alignment we then extract M-phones along with the corresponding sequence of speech feature 

frames. Each M-phone is uniquely identified by its key, e.g. ih_l / ae k sh n sil. The 

key is a string representation obtained by joining on / the central Cl-state, i.e. ih_l above, and the 

surrounding phonetic context, in this case ae k sh n sil; is a placeholder marking the 

position where the central Cl-state ih_l occurs in the context. Besides the maximal order M-phones, 
we also collect back-off M-phones as outlined above. With each back-off we clone the frames from 
the maximal order M-phone to the back-off one. We found it useful to augment the phonetic context 
with word boundary information. The word boundary has its own symbol, and occupies its own context 
position. 
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All M-phone instances encountered in the training data are aggregated using MapReduce. For each 
M-phone that meets a threshold on the minimum number of frames aligned against itself, we estimate 
a GMM using the standard splitting algorithm |13], following the rule in ([T]l to size the GMM. The 
M-phones that have more frames than an upper threshold on the maximum number frames (256k in our 
experiments)^ are estimated using reservoir sampling |[T4ll . Variances that become too low are floored to 
a small value (0.00001 in our experiments). 

A. Comparison with Existing Approaches and Scalability Considerations 

BAM can be viewed as a simplified version of DT state clustering that uses question sets consisting 
of atomic CI phones, queried in a pre-defined order. This very likely makes BAM sub-optimal relative 
to standard DT modeling, yet we prefer it due to ease of implementation in MapReduce. 

The approach is not novel: [15] proposes a very similar strategy where the probability assigned to 
a frame by a triphone GMM is interpolated with probabilities assigned by left, right diphone GMMs, 
and CI phone GMMs, respectively. However, the modeling approach in BAM is not identical to [15] 
either: the former does indeed back-off in that it uses only the maximum order M-phone found in the 
model, whereas the latter interpolates up the back-off tree, and allows asymmetric back-offs. It is of 
course perfectly feasible to conceive BAM valiants that come closer to the approach in [15] by using 
interpolation between M-phones at various orders. 

Scalability reasons make the current BAM implementation an easier first attempt when using very large 
amounts of training data: a BAM with M = 5 estimated on 87 000 hours of training data leads to roughly 
2.5 billion (2489054034) 11-phone types. DT building requires as sufficient statistics the single-mixture 
Gaussians for M-phones sharing the same central CI phone and state. Assuming uniformity across central 
phone and state identity, we divide the total number of M-phones by the number of phones (40) times 
the number of states/phone (3) to arrive at about 25 million different M-phones that share a given central 
state and phone. Storing a single-mixture Gaussian for each M-phone requires approximately 320 bytes 
(39 •4-2). Under the uniformity assumption above, the training data for each DT amounts to about 
25 • 320 = 8 GB of storage. It is more realistic to assume that some central CI phones will have ten 
times more M-phones than the average, leading to a memory footprint of 80 GB, which starts to become 
problematic although still feasible (perhaps by employing sampling techniques, or reducing the context 
size M). 

'For convenience, we use the "k" shorthand to denote thousands, e.g. we write 2561c instead of 256000; a value of 41 898 799 
is rounded to 41 899k. 



February 6, 2013 



DRAFT 



JOURNAL OF IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 5 

To avoid such scalability issues, we resort to MapReduce and streaming the data for M-phones sharing 
a given central triphone to the same Reducer, one maximal order M-phone at a time, as described in 



Section III M-phones at lower orders 1 ... M — 1 are estimated by accumulating the data arriving at the 
Reducer into buffers of fixed capacity, using reservoir sampling [14] to guarantee a random sample of 
fixed size of the input data stream. With careful sorting of the M-phone stream (see Section [III] ), M — 1 
reservoirs are sufficient for buffering data on the Reducer until all the data for a given M-phone arrives, 
the final GMM for a given M-phone is estimated and output, and the respective buffer is flushed. The 
reservoir size thus controls the memory usage very effectively. For example, when using 256k as the 
maximum number of frames for estimating a given GMM (equal to the maximum reservoir size), only 
160 MB of RAM are sufficient for building a BAM with M = 5. 

Our approach to obtaining large amounts of training data is very similar to that adopted in |4]. Table 
VI there highlights the gains from using increasing amounts of training data from 375 hours to 2210 
hours, and shows that past 1350 hours a system with 9k states and about 300k Gaussians gets diminishing 
returns in accuracy. Our modeling approach and its implementation using MapReduce allows both the 
use of significantly more training data and estimation of much larger models: in our experiments we used 
87 000 hours of training data and built models of up to 1.1 million states and 40 million Gaussians. 

III. Distributed Acoustic Modeling 

BAM estimation and run-time are implemented using MapReduce and SSTable, and draw heavily from 
the large language modeling approach for statistical machine translation described in 111. 



A. BAM Estimation Using MapReduce 

Our implementation is guided by the large scale n-gram language model estimation work of lfT6ll . 
MapReduce is a framework for parallel processing across huge datasets using a large number of machines. 
The computation is split in two phases: a Map phase, and a Reduce one. The input data is assumed to be 
a large collection of key-value pairs residing on disk, and stored in a distributed file system. MapReduce 
divides it up into chunks. Each such chunk is processed by a Map worker called a Mapper, running on 
a single machine, and whose entire lifetime is dedicated to processing one such data chunk. 

Mappers are stateless, and for each input key-value pair in a given chunk they output one or more new 
key-value pairs; the computation of the new value, and the assignment of the new output key are left 
to the user code implementing a Mapper instance. The entire key space of the Map output is disjointly 
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partitioned according to a sharding Junction: for any key value output by Map, we can identify exactly 
one Reduce shard. 

The key-value pairs output by all Mapper instances are routed to their corresponding Reduce shards 
by a Shuffler, using the sharding function mentioned above. The Shuffler also collates all values for a 
given key, and presents the tuple of values along with the key to the Reducer (a Reduce worker), as one 
of the many inputs for a given Reduce shard. It also sorts the keys for a given shard in lexicographic 
order, which is the order in which they are presented to the Reducer. Each Reduce worker processes 
all the key-value pairs for a given Reduce shard: the Reducer receives a key along with all the values 
associated with it as output by all the Mappers, and collected by the Shuffler; for a given input key, the 
Reducer processes all values associated with it, and outputs one single key-value pair. It is worth noting 
that the Reducer cannot change the identity of the key it receives as input. The output from the Reduce 
phase is stored in an SSTable: an immutable persistent B-tre^ associative array of key-value pairs (both 
key and value are strings), that is partitioned according to the same sharding function as the one used 
by the MapReduce that produced it. Another SSTable feature is that it can be used as a distributed in- 
memory key-value serving system {SSTable service) with S servers (machines), each holding a partition 
containing 1/5 of the total amount of data. This allows us to serve models larger than what would fit 
into the memory of a single machine. 

Fig. [T] describes the training MapReduce, explained in more detail in the following two subsections. 

1) Mapper: Each Mapper instance processes a chunk of the input data, one record at a time. Each 
record consists of a key-value pair; the value stores the waveform, the word level transcript for the 
utterance, and other elements. For each record arriving at the Mapper we: 

• generate the context-dependent state-level Viterbi alignment by finding the least cost path through 
the state space of the EST H o C o L o W using the the first-pass AM 

• extract maximal order M-phones along with speech frames, and output (M-phone key, frames) pairs 

• compute back-off M-phones and output (M-phone key, empty) pairs. 

We note that in order to avoid sending too much data to the Reducer, we do not copy the frames to the 
back-off M-phones, which would lead to replicating the input data M times. To make sure that the data 
needed for estimating back-off M-phones is present at a given Reducer we resort to a few tricks: 

• the sharding function takes as argument the central triphone. This guarantees that all M-phones 
sharing a given central triphone (sh-ih+n in our example), are handled by the same Reducer 



format similar to SSTable has been open-sourced as part of the LevelDB project 



http://code.google.eom/p/leveldb/ 
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<S> action </S> — frames 



<S> fashion </S> — frames 



<S> faction </S> — frames 



Clnunked Input Data 



T 



IVIapper: 

-generate alignment: 

sil ae k sh ih n sil 

-extract and emit M-phones 



ih_1 / ae k sh _ 
ih_1 / k sh _ 
ih_1 / sh . 



. n sil ~ , frames_A 
. n sil , 
n 



IVIapper: 

-generate alignment: 

sil f ae sh ih n sil 

-extract and emit M-phones 

ih_1 / f ae sh n sil ~ , frames_B 

ih_1 / ae sh n sil 

ih_1 / sh n 



IVIapper: 

-generate alignment: 
sil f ae k sh ih n sil 
-extract and emit M-phones 



ih_1 / ae k sh _ 
ih_1 / ksh_ 
ih_1 / sh. 



. n sil ~ , frames_C 
. n sil , 
n 



I 

} 
} 



Shuffling: 

- M-phones sent to their Reduce shard, as determined by the partitioning key shard(ih_1 / sh _ 

- M-phone stream arriving at a given Reduce shard is sorted in lexicographic order 



7 

^Reducer for partition sharddh 1/sh n): 
maintains a stack of nested M-phones in reverse order along with 



frames reservoir 


ae k sh 


. n sil 


f ae sh 


n sil 


ksh 


n sil 


ae sh 


n sil 


sh 


n 



, frames_A I frames_B I frames_C 



When a new M-phone arrives: 

- pop top entry 

- estimate GMM 

output (M-phone, GMM) pair 



Partition shard(ih_1 / sh n) of the associative array (M-phone, 

GMM) storing BAM 



T 



SSIable output storing BAM as a distributed associative array (M-phone, key) 



Fig. 1. MapReduce Estimation for Back-Off Acoustic Model 



• the M-phones need to arrive at the Reducer in a certain order, since only the maximal order M-phones 
carry speech frame data. The sorting of the M-phone stream needs to be such that any given maximal 
order M-phone arrives at the Reducer before all of its back-off M-phones; this allows us to buffer the 
frames for all the back-off M-phones down to the central triphone state. We accomplish this by relying 
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on the implicit lexicographic sorting of the keys, and re-keying each M-phone before outputting it 
such that the context Cl-phones are listed in proximity order to the central one; missing context CI- 
phones (due to utterance boundaries), are represented using ~ to ensure correct sorting. For example 
ih_l / ae k sh n sil is actually keyed as: ih_l / sh n k sil ae ~, to guaran- 
tee that M-phones sharing the central triphone ih_l / sh n are processed in order of longest 

to shortest context at the Reducer processing the partition partition { ih_l / sh n). 

2) Reducer: After shuffling, each M-phone has its frame data (if carrying any), collated and presented 
to the Reducer along with the M-phone key. Since the Reducer cannot change the key of the input, 
it needs to output the GMM for an M-phone when it arrives at the Reducer. The sorting described in 



Section III-Al guarantees that the M-phones sharing the same central triphone arrive in the correct order 
(high to low M-phone order). Every time a maximal order M-phone arrives at the Reducer we estimate a 
GMM from its data (assuming the number of frames is above the lower threshold), and also accumulate 
its data in the reservoirs for all of its back-off M-phones which are buffered in a "first-in first-out" stack. 

Reservoir sampling is a family of randomized algorithms for randomly choosing K samples from a list 
L containing n items, where n is either a very large or unknown number. Our implementation populates 
the reservoir with the first K samples to anive at the Reducer. If more samples arrive after that, we draw 
a random index r in the range [0, current sample index — 1); if r < we replace the sample at index r 
in the reservoir with the newly arrived one, and otherwise we ignore it. Every time a back-off M-phone 
arrives at the Reducer, it is guaranteed to be the same one as the M-phone at the top of the stack due to 
the sorting of the M-phone stream done by the Shuffler. We then: 

• add to the reservoir at the top of the stack any frames that arrived at the Reducer with the current 
M-phone; 

• pop the M-phone and the corresponding reservoir from the top of the stack; 

• estimate the GMM for this back-off M-phone if the accumulated frames exceed the lower threshold 
on the minimum number of frames, or discard the M-phone and its data otherwise; 

• output the pair (M-phone key, GMM). 

Due to the particular sorting of the M-phone stream, the Reducer is guaranteed to have seen all the 
frame data for an M-phone when the GMM estimation takes place. The resulting SSTable stores the 
BAM as a distributed (partitioned) associative array (M-phone key, GMM). 
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B. BAM Test Run-time Using SSTable Service 

At test time we rescore N-best lists for each utterance using BAM. We load the model into an SSTable 
service with S servers, each holding 1/5" of the data. For each hypothesis in the N-best list, we: 

• generate the context-dependent state-level Viterbi alignment after composing H o C o L with the 
transcript W from the first-pass; the alignment is generated using the first-pass AM and saved with 
the hypothesis 

• extract maximal order M-phones 

• compute back-off M-phones 

• add all M-phones to a pool initialized once per input record (utterance). 

Once the pool is finalized, it is sent as a batch request to the SSTable service. The M-phones that 
are actually stored in the model are returned to the Mapper, and are used to rescore the alignment for 
each of the hypotheses in the N-best list. For each segment in the alignment we use the highest order 
M-phone that was retrieved from the BAM SSTable. If no back-off M-phones are retrieved for a given 
segment, we back-off to the first-pass AM score for that segment which is computed during the Viterbi 
alignment. 

To penalize the use of lower order M-phones, the score for a segment with an M-phone of lower 
order o (o > 0) than the maximum one M incurs a per-frame back-off cost. The order of an asymmetric 
M-phone is computed as the maximum of the left and right context lengths. The per-frame back-off 
cost reaches its maximum value when the model backs-off all the way to using the first-pass AM (DT 
clustered state), o = 0. To formalize, assume that we are using a GMM with Q components for modeling 
M-phone s, and that the order of s is o{s), computed as described above. The log- likelihood assigned to 
a frame y aligned against state s will be: 

Q 

9=1 

/bo • (M - o{s)) (2) 

where /bo > is the per-frame back-off cost, and niq are the mixture weights for each component of 
the GMM for state s: P{y\^,gq,T,s,q). 

The final score for each hypothesis W, log P{W , A, V), is computed by log-linear interpolation 
between the first-pass AM and that obtained from the second pass one (BAM, or first-pass AM if running 
sanity checks, see Table [lljl, followed by the usual log-linear combination between AM and language 
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model (LM) scores: 

logPAM{A\W,V) = X-logPi,tp.ss{A\W,V) + 

(1.0-A)-logP2ndpass(A|ir,F)- 

log(^i) (3) 
logP{W,A,V) = I /WLM- log Pam{A\W,V) + 

logPLM(W^)-log(Z2), (4) 

where A denotes the acoustic features, W denotes the word sequence in an N-best hypothesis, wlm 
is the language model weight, P{W, A, V) is the probability assigned to the word sequence W and 
the corresponding acoustic features A by using V as a state-level alignment, and log(Zi), log(Z2) are 
normalization terms ignored in rescoring; both first-pass AM and BAM pair states with frames using the 
same state-level Viterbi alignment V computed using the first-pass AM. 

IV. Experiments 

We ran our experiments on Google Voice Search training and test data. The subsections below detail 
the training and test setup, as well as the baseline acoustic models and their performance. 



A. Task Description 

There are two training sets that we used in our experiments: 

• maximum likelihood (ML) baseline: 1 million manually transcribed Voice Search spoken queries, 
consisting of 1300 hours of speech (468 887 097 frames); 

• filtered logs: 110 million Voice Search spoken queries along with 1-best ASR transcript, filtered 
by confidence at 0.8 threshold, consisting of 87 000 hours of speech (31530373 291 frames). The 
query-level confidence score used for filtering training data transcriptions is derived using standard 
lattice-based word posteriors. The best baseline AM available, namely the boosted maximum mutual 



information (bMMI) baseline AM trained as we describe in Section IV-B is used for generating both 

transcriptions and confidence scores. 
As development and test data we used two sets of manually transcribed data that do not overlap with the 
training data (the utterances originate from non-overlapping time periods in our logs). Let's denote them 
as data sets DEV, and TEST, consisting of 27 273 and 26 722 spoken queries (87 360 and 84918 words), 
respectively. All query data used in the experiments (training, development and test), is anonymized. 
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As a first attempt at evaluating BAM, we carry out N-best list rescoring experiments with N = 10. 
While 10-best may seem small, such N-best lists have approximately 7% oracle word-error-rate (WERF 




80% of the test set achieves 0% oracle WER at 10-best, so there is plenty of room for improvement 
when doing 10-best list rescoring. In addition to this, very large LM rescoring experiments for the same 
task, e.g. ||21, have shown that 10-best list rescoring was very close to full lattice rescoring. 

B. First Pass Acoustic Models 

The feature extraction front-end is common across all experiments: 

• the speech signal is sampled at 8 kHz, and quantized linearly on 16 bits 

• 13 -dimensional perceptual linear predictive (PLP) coefficients |Il7l are extracted every 10 ms using a 
raised cosine analysis window of size 25 ms, consisting of 200 samples zero-padded to 512 samples 

• 9 consecutive PLP frames around the current one are then stacked to form a 1 17-dimensional vector 

• a joint transformation estimated using linear discriminant analysis (LDA) followed by semi-tied 
covariance (STC) modeling llTSll reduces the feature vector down to 39 dimensions in a way that 
minimizes the loss from modeling the data with a diagonal covariance Gaussian distribution. 

Since BAM uses ML estimation, we decided to use two baseline AMs in our experiments: an ML 
baseline AM that matches BAM training, and a discriminative (bMMI) baseline AM which produces the 
best available results on our development and test data. All models use diagonal covariance Gaussians. 

The ML AM used in the first-pass is estimated on the ML baseline data in the usual staged approach: 

1) three-state, CI phone HMMs with output distributions consisting of single Gaussian, diagonal 
covariance 

2) standard DT clustering for triphones, producing 8k context-dependent states 

3) GMM splitting, resulting in a model with 330k Gaussians: 

• the minimum number of frames A'min for a given context-dependent state is 18k, enforced 
during DT building; 

• the maximum number of frames A^max for a given context-dependent state is 256k; GMMs 
for states with more than the maximum number of frames are estimated by random sampling 
down to 256k frames 

'The oracle WER measures the WER along the hypothesis in the N-best hst that is closest in string-edit distance to the 
transcription for that utterance. 




on our development data set, starting from 15% WER baseline. Also, as shown in Section IV-G[ about 
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• varmix estimation is used to determine the number of mixtures according to the amount of 
training data, as in ([T]) with a = 0.3, /3 = 2.2; this amounts to 42 components when the 
number of frames n is at its minimum value of 18k, and 92 mixture components when it is at 
its maximum value of 256k. 
The bMMI baseline AM is obtained by running an additional discriminative training stage on signifi- 
cantly more training data than the ML baseline: 

4) bMMI training |[T9l on the ML baseline data augmented with 10 million Voice Search spoken 

queries (approximately 8000 hours) and 1-best ASR transcript, filtered by confidence. 
Training and test are matched with respect to the first-pass AM used: experiments reporting development 
and test data results using the ML baseline AM use a BAM trained on alignments generated using the 
same ML baseline AM; likewise, when switching to the bMMI baseline AM we use it to generate training, 
development and test data alignments. 

C. N-best List Rescoring Experiments using ML Baseline AM 

The development data is used to optimize the following parameters for BAMs trained on the ML 
baseline data, as well as 1%, 10% and 100% of the filtered logs data, respectively: 

• model order M = 1, 2, . . . , 3 (triphones to 7-phones), 

• acoustic model weight in log-linear mixing of first-pass AM scores with the rescoring AM, ([3]): 
A = 0.0, 0.2, 0.4,..., LO, 

• language model weight, Q: wlm = 7, 12, ... , 22, 

• per-frame back-off weight, ([2]): /bo = 0.0, 0.2, . . . , 1.0. 

Across all experiments reported in this section we kept the following constant: 

• the ML baseline AM is trained on the ML baseline data, 

• minimum number of frames for an M-phone state A^min is 4k except for one experimental condition 
setting it to 18k to compare against the ML baseline AM, see Table [lH 

• maximum number of frames (reservoir size), A'^tnax for an M-phone state is 256k: 

- for the a = 0.3 and (3 = 2.2 varmix setting this means a maximum number of 92 mixture 
components per state 

- for the a = 0.7 and (3 = 0.1 varmix setting this means 620 mixture components per state; the 
GMM splitting becomes very slow for such large numbers of mixture components, so we only 
trained M = I models for this setting. 
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TABLE I 

Maximum Likelihood Back-off Acoustic Model (BAM) Results on the Development Set, 10-best 
Rescoring, in Various Training and Test Regimes. 



iViUUcl 












WPR ^'^/^/T^ 

{/c) 


IMU. 

Gaussians 


TJ? ATAJTAjr" Pi AT A — 


ML baseline data (1.3k hours) 










iviL Daseiine Aivi, a 


= 0.0, WLM = 17 








ly.l (,z. iz.j j 


JA /K 


iVlL UaaCllllC rtivi, /\ 


= 0.6, WLM = 17 










^77V 


iVllj DdScllIlc rViVl, A 


= 1.0 (first-pass), = 17 








1 8 8 ('7 '^M '^/l 7 7^ 


JZ /K 


Tl? ATAJTAir" Pi AT A — 


100% filtered logs data (87k hours) 










ID A \/t Ah. „, 
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— i', /bo 
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= 0.3, /3 = 2.2, 
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17 f 
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ZZ Z i UK 


BAM iVinin = Ak, a 


= 0.3, /3 = 2.2, 
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0.0, WLM 


= 17, /bo 
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18.0 (2.0/4.2/11.8) 


41 899k 


BAM iV„iin = 4A:, a 


= 0.3, /3 = 2.2, 
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41 899k 
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= 0.3, /3 = 2.2, 


M = 3, A = 


0.6, WLM 


= 17, /bo 


= 0.2 


16.8 (2.0/3.8/11.0) 


41 899k 


BAM iVinin = 4A:, a 


= 0.3, /3 = 2.2, 


A/ = 3, A = 


0.6, WLM 


- 17, /bo 


= 0.4 


16.8 (2.0/3.8/11.0) 


41 899k 


BAM iV„iin = 4A;, a 


= 0.3, ^ = 2.2, 


Af = 3, A = 


0.6, WLM 


= 17, /bo 


= 0.6 


16.8 (2.0/3.8/11.0) 


41 899k 


BAM iVinin = 4A:, a 


= 0.3, /3 = 2.2, 


A/ = 3, A = 


0.6, WLM 


- 17, /bo 


= 0.8 


16.9 (2.0/3.8/11.1) 


41 899k 


BAM iV„ii„ = 4A:, a 


= 0.3, /3 = 2.2, 


A/ = 3, A = 


0.6, WLM 


- 17, /bo 


= 1.0 


16.9 (2.0/3.8/11.1) 


41 899k 



1) Development Set Results: Table |T] shows the most relevant results when rescoring 10-best lists with 
BAM in the log-linear interpolation (Q; S/D/I denotes Substitutions, Deletions and Insertions, respectively. 

We built and evaluated models for M = 1, 2, . . . , 5 but as the results in Table |l] show, there is no 
gain in performance for values of M > 2; since training such models is expensive, we stopped early on 
experimenting with models at M = 4, 5, and as such Table |I] reports results for M = 1, 2, 3 only. 

The first three rows show the performance, and size (in number of Gaussians), of the ML AM baseline 



(stage 3 in Section IV-B I on the development set DEV. Somewhat surprisingly, there is a small gain 
(0.3% absolute) obtained by interpolating the first and second pass scores produced by the ML baseline 
AM for the same utterance, as well as a loss of 0.3% absolute when the N-best list is rescored with the 
same AM. We point out this oddity because the same second pass alignments are rescored with BAM, 
and hence this small improvement should not be credited to better modeling using BAM, but rather to 
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re-computation of alignments in the second pass for each N-best hypothesis individually. This discrepancy 
could be due to one or more possible sources of mismatch between the first-pass system and the N-best 
list rescoring on^ 

• different frame level alignments for the same word hypothesis. This could happen due to the fact that 
the rescoring system uses extremely wide beams when computing the alignment for each hypothesis 
in the N-best Ust, as well as the fact that some optimizations in the generation of the first-pass static 
CLG FST network may not be matched when aligning a given hypothesis W using H oC o LoW; 

• slightly different acoustic model settings in computing the log-likelihood of a frame; 

• slightly different front-end configurations between first-pass and rescoring. 

The per-frame back-off (|2]) does not make any difference at all for Af = 1 , 2 models (we do not include 
the results in Table |l] since they are identical to those obtained under the /bo = 0.0 condition), and has 
a minimal impact on the M = 3 model. 

Another point worth making is that BAM stands on its own, at least in the N-best list rescoring 
framework investigated: comparing the rows for A = 0.0, we observe that BAM improves over the ML 
baseUne AM for all values M = 1, 2, 3, with the optimal value being M = 2. 

Finally, the fact that the larger M = 3 value does not improve performance over the M = 2 model 
despite the availability of data to estimate GMMs reliably is an interesting result in its own right, 
suggesting that simply increasing the phonetic context beyond quinphones may in fact weaken the acoustic 
model. 

2) Test Set Results: Table shows the results of rescoring 10-best lists with the BAM in the log- linear 
interpolation setup of Q, along with the best settings as estimated on the development data. 

The first training regime for BAM used the same training data as that used for the ML part of the 
baseline AM training sequence. When matching the threshold on the minimum number of frames to the 
threshold used for the baseline AM (18k), BAM ends up with fewer Gaussians than the baseline AM: 223k 
vs. 327k. This is not surprising, since no DT clustering is done, and the data is not used as effectively: 
many triphones (i.e., M = 1) are discarded, along with their data. However, its performance matches that 
of the baseline AM in a 10-best list rescoring setup; no claims are made about the performance of such 
a model in the first pass. Lowering the threshold on the minimum number of frames to 4k (26 mixture 

*We tried our best to minimize tiiis discrepancy but given the many parameter settings in an ASR system this task has proven 
to be very difficult. The small difference reported was the best we could achieve after spending a significant amount of time on 
this issue. 
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TABLE II 

Maximum Likelihood Back-off Acoustic Model (BAM) Results on the Test Set TEST, 10-best list 

Rescoring, in Various Training Regimes. 



Model 
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100% filtered logs data (87k hours) 










BAM iV„iin = 4:k, a 


= 0.3, /3 = 2.2, M = 2, A = 0.6, wlm = 


17, /bo = 


0.0 


10.6 (1.0/2.2/7.4) 
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BAM iVniin = 4:k, a 
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17, /bo = 


0.0 


10.6 (1.2/2.0/7.3) 


14435k 



components at a = 0.3, /3 = 2.2), does increase the number of Gaussians in the model to 490k. 

The second training regime for BAM uses the filtered logs data, in varying amounts: 1%, 10%, 100%, 
respectively. A surprising result is that switching from manually annotated data to the same amount 
of confidence filtered data provides a small absolute WER gain of 0.1-0.2%. This suggests that the 
confidence filtered data is just as good as the manually annotated data for training acoustic models that 
are used in an N-best list rescoring pass. 

From then on, BAM steadily improves as we add more filtered logs training data in both the a = 0.3, 
/3 = 2.2 and a = 0.7, /3 = 0.1 setups, respectively: the first ten-fold increase in training data brings a 
0.4-0.5% absolute WER reduction, and the second one brings another 0.3% absolute WER reduction. 
As shown in Fig. |2j the WER decreases almost linearly with the logarithm of the training data size. 

The BAM WER gain amounts to 1.3% absolute reduction (11% relative) on the one-pass baseline 
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Fig. 2. ASR Word Error Rate as a Function of Training Data Size, for two different BAM configurations. The WER decreases 
almost linearly with the logarithm of the training data size, and it is marginally influenced by the BAM order (context size). 



of 11.9% WER. Comparing the baseline results when using ML and bMMI models, respectively (see 
Tables [n] and [In]), we note that BAM does not fully close the 18% relative difference between the ML and 
the bMMI first-pass AMs performance, which leaves open the possibility that a discriminatively trained 
BAM would yield additional accuracy gains. 

As Fig. [3] shows, the best predictor for model performance is the number of mixture components, 
which is consistent with the results on development data, and across the two different a, /3 settings we 
experimented with. The best model order M is between M = 1 and M = 3 (depending on the maximum 
number of mixtures/state allowed in the model). In fact, with enough mixtures per M-phone, triphones 
(i.e., M = 1) perform just as well as quinphones (i.e., M = 2) or 7-phones. 

D. N-best List Rescoring Experiments using bMMI Baseline AM 



When switching to using the bMMI AM (stage 4 in Section IV-B i as the first-pass model in both 



training and test, the baseline results are significantly better, see Table III Despite the fact that it is not 
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Fig. 3. ASR Word Error Rate as a Function of Model Size, for two different BAM configurations. The WER decreases almost 
linearly with the logarithm of the model size (measured in number of Gaussians), and it is marginally influenced by the BAM 
order (context size). 



discriminatively trained, BAM still provides 0.6% absolute (6% relative) reduction in WER. 

E. M-phone Hit Ratios and Other Training Statistics 

Similar to n-gram language modeling, we can compute M-phone hit ratios at various orders: the 
percentage of M-phones encountered in the test data (10-best hypotheses), with left, right context of 
length /, r, respectively, that are also present in the model (and thus there is no need to back-off further); 



Table IV shows the values for BAM trained on the filtered logs data (87 000 hours). M-phones at query 
boundaries do not have symmetric context, which explains the non-zero off-diagonal values. The maximal 
order M-phones (sum across last row and column), amount to 42.3% of the total number of M-phones 
encountered on 10-best list rescoring, with 23.6% at the highest order 3, 3. 

We also note that only on 1.1% of test segments do we back-off out of the M-phones stored in BAM, 
and use the GMM stored with the clustered state in the first-pass AM. This shows convincingly that 
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TABLE III 

DISCRIMINATIVE (BOOSTED-MMI) ACOUSTIC MODEL BASELINE RESULTS AND BAM PERFORMANCE ON THE TEST SET 

TEST, 10-BEST List Rescoring. 



Model 


WER (S/D/I), 


INO. 




(%) 


Gaussians 


TRAINING DATA = ML baseline data (1.3k hours) + 10k hours filtered logs data 




bMMI baseline AM, A = 0.0, wlm = 17 


10.2 (1.1/1.7/7.4) 


327k 


bMMI baseline AM, A = 0.6, wlm ^ 17 


9.7 (1.1/1.6/7.0) 


327k 


bMMI baseline AM, A = 1.0 (first-pass), wlm = 17 


9^ (1.1/1.6/7.1) 


327k 


TRAINING DATA = 100% filtered logs data (87k hours) 






BAM Nmin = 4fc, a = 0.3, /3 = 2.2, M = 3, A = 0.8, wlm = 17, /bo = 0.0) 


9.2 (1.0/1.6/6.7) 


40 360k 



as both the amount of training data and the model size increase, the DT clustering of triphone states 
is no longer necessary as a means to cope with triphones that are unseen or have too little training 
data. Tables [V] and VI show the distribution of Gaussian mixtures, and M-phone types at various orders, 



TABLE IV 

M-PHONE Hit Ratios on 10-best Hypotheses for Test Data for BAM Using M = 3 (7-phones) Trained on the 

Filtered Logs Data (87 000 hours) 



left, right 
context size 





1 


2 


3 





1.1% 


0.1% 


0.2% 


4.3% 


1 


0.1% 


26.0% 


0.9% 


3.4% 


2 


0.7% 


0.9% 


27.7% 


2.2% 


3 


3.8% 


2.9% 


2.0% 


23.6% 



respectively. The total number of Gaussian mixtures in the model is 41 898 799, and the total number of 
M-phone types is 1 146 359, achieving our goal of scaUng the AM 100 times larger than the size of the 
first-pass AM, which consists of 0.3 million Gaussians and about 8k context-dependent states. 



F. Data Flow in Training MapReduce 

The filtered logs training data consists of approximately 110 million Voice Search spoken queries, or 
87 000 hours of speech, or 31.5 billion frames; on disk it is stored as compressed SSTables at around 



February 6, 2013 



DRAFT 



JOURNAL OF IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 



19 



TABLE V 

Number of Gaussian Mixtures at Various M-phone Orders for BAM Using M = 3 (7-phones) Trained on 

THE Filtered Logs Data 



left, right 

context 

size 





1 


2 


3 








9802 


138 694 


776 488 


1 


9619 


3 193 495 


581846 


1 072 242 


2 


143 940 


613 401 


17519632 


1 134640 


3 


843 282 


1274683 


1 127789 


13 459 246 



TABLE VI 

Number of M-phone Types at Various Orders for BAM Using M = 3 (7-phones) Trained on the Filtered 

Logs Data 



left, right 

context 

size 





1 


2 


3 








114 


1902 


14384 


1 


115 


55 551 


11942 


27 426 


2 


2124 


12673 


491 528 


32248 


3 


15 858 


33 035 


32717 


414742 



5.7 TB. The output of the Map phase consists of about 8.54 TB uncompressed data, which is processed 
by the Reduce function. 

Of the total of 184 million M-phones encountered in the training data (including back-offs), only one 
million pass the lower threshold on the number of frames (4k); of those, approximately 36k (3.6%) have 
more frames than the upper threshold (256k), and are estimated using reservoir sampling. 

Most time is spent during GMM splitting in the Reduce phase. The estimation takes about 48 hours 
on 1000 Reduce workers; at half-time, there are approximately 10% Reduce partitions still being worked 
on: since we need to use our own partitioning function, the Reduce partitions are fairly uneven, with the 
largest partition being about 70 GB (a lot of the data sent to the largest 3-5 reduce partitions is silence 
frames), and the smallest about 2 GB. 
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The size on disk for the largest models we built is about 30 GB. For N-best list rescoring we load the 
generated data into an in-memory key-value serving system with 100 servers, each holding 1/100 of the 
model stored uncompressed for faster look-up. 

G. Validation Setup 

To verify the correctness of our implementation we set up a validation training and test bed: 

• train on the development set by keeping all the data and M-phones by setting the minimum number 
of frames N^nn = 1; 

• test on the subset of the development data with 0% WER at 10-best (the manual transcription is 
among the 10-best hypotheses extracted in the first-pass). This is a significant part of the development 
set, approximately 80% (21 751/27 273) utterances. The first-pass AM achieves 7.6% WER on this 
subset; 

• minimize the effect of the language model and first-pass AM scores by setting wlm = 0.1, A = 0.0 
in ([3]|4]), and use only the AM score assigned by BAM. 

The intent behind choosing this setup is that as the order M increases, BAM should "memorize" the 
alignments on the training set (even M-phones with a single training frame are retained in the model), 
and severely penalize mismatched alignments from N-best competitors to the correct transcript at test 
time. It is for this reason that we choose to test on the subset which contains the correct transcription 
(used in training) in the N-best list. 

The results are presented in Table |VII| for various context types, and model orders M. A surprising 
result is the fact that a triphone equivalent BAM (M = 1) that does not use word boundary information is 
significantly weaker than its counterpart that uses that information. Increasing the model order improves 
performance in both context settings. The residual WER is due to homophones. 

TABLE VII 

Word-Error-Rates in Validation Setup, Using Various Context Types as Well as Model Orders M 



Context type 


M 


WER, (%) 


CI phones 


1 


4.5 


CI phones 


5 


1.5 


+ word boundary 


1 


1.8 


+ word boundary 


5 


0.6 
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V. Conclusion and Future Work 

We find these results very encouraging, proving that large scale distributed acoustic modeling has the 
potential to greatly improve the quality of ASR. Expanding phonetic context is not really productive: 
"more model" by increasing M > 2 yields no gain in accuracy, so we still need to find alternative ways 
to fully exploit the large amounts of data that are now available. The best predictor for ASR performance 
is the model size, as measured by the number of Gaussians in the model. 

State clustering using DTs as a means of coping with data sparsity may no longer be necessary: only 
1.1% of the state-level segments on test data aUgnments back-off to the DT clustered GMM. It remains to 
be seen if DTs have other modeling advantages: since we used a rescoring framework, and the first-pass 
alignments are generated with an AM that uses DT state clustering, it is still very much part of the core 
modeling approach presented. In addition to that, our best results are obtained by interpolating BAM 
with the baseline AM. 

Obvious future work items that are perfectly feasible at this scale include: DT state tying, re-computing 
alignments in BAM ML training, and discriminative GMM training. 

Another possible direction exploits the BAM ability to deal with large phonetic contexts, and large 
amounts of data. In the early stages of this work we successfully built BAMs with M = 5, but their 
performance on development data did not justify experimenting further with such large models. It would 
be interesting to build BAMs by starting from the surface form of words (using letters as context elements 
instead of phones) and inferring the HMM topology for each unit in a purely data-driven manner, along 
the lines described in Section 3.6 of ll20ll . 

It seems that we literally have more data than we know what to do with, and better modeling techniques 
at large scale are needed. Non-parametric modeling techniques may be well suited to taking advantage 
of such large amounts of data. 

Appendix A 

How Much Data is Needed to Estimate a Gaussian Well? 

Consider n i.i.d. samples Xi, ...,Xn drawn from a normal distribution M{fi,a'^). We would like an 
upper-bound on the probability that the sample mean estimate X = ^ SiLi -^i more than q ■ a away 
from the actual mean, with q E (0, 1). 

If Xi,...,Xn M{fi,a^) then (X - fi)/{a/^/^) ~ AA(0, 1). Thus P{\X - n\ > q ■ a) = P{\Z\ > 
q ■ ^/n) = 2 • -y/n • q) where Z is the standard normal i.e. Z ~ A/'(0, 1), and <I> is the cumulative 
distribution function (CDF) for a standard normal random variable. Thus P{\X — ix\ > q ■ a) < p is, 
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equivalent to choosing n such that: 2 • (^{—^/n ■ q) < p. In Matlab this can be easily calculated as 

n = (icdf ('Normal' , (l-p/2), 0, 1 ) /q) ^ 2 and in R as 

n 



(qnorm ( l-p/2 ) /q) "2, see Table VIII for a few sample values 



TABLE VIII 

Number of Samples n Needed to Estimate the Mean of a Normal Distribution within q - a of the Actual 

Mean, with Probability Lower than p 



p 


q 


n 


0.05 


0.05 


1537 


0.06 


0.06 


983 


0.07 


0.07 


670 


0.08 


0.08 


479 


0.10 


0.10 


271 


0.15 


0.15 


95 



A "good" value for the sample size is n = 300, . . . , 1000. We also note that if the sample size is this 
large then the statement will still hold approximately true even if the population is not normal, since by 
the central limit theorem X will be very close to normal even if the population is not. 

A similar derivation can be carried out for the sample variance estimate: assuming normality, 
(n — 1) • follows a distribution on n — 1 d.f. 
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